Skip to content

Improve testing of {Scheduler|Worker}MetricCollector#6945

Merged
fjetter merged 14 commits into
dask:mainfrom
hendrikmakait:improve-worker-metric-collector-testing
Aug 31, 2022
Merged

Improve testing of {Scheduler|Worker}MetricCollector#6945
fjetter merged 14 commits into
dask:mainfrom
hendrikmakait:improve-worker-metric-collector-testing

Conversation

@hendrikmakait

@hendrikmakait hendrikmakait commented Aug 24, 2022

Copy link
Copy Markdown
Member

Closes #6943.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait self-assigned this Aug 24, 2022
@github-actions

github-actions Bot commented Aug 24, 2022

Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 53m 10s ⏱️ + 10m 13s
  3 054 tests +1    2 968 ✔️  - 1    84 💤 ±0  2 +2 
22 592 runs  +7  21 610 ✔️ +7  980 💤  - 2  2 +2 

For more details on these failures, see this check.

Results for commit 51f9781. ± Comparison against base commit 817ead3.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait marked this pull request as ready for review August 25, 2022 08:04
@hendrikmakait

Copy link
Copy Markdown
Member Author

Failing tests:

New ones that look similar to other mysterious test timeouts we've had:

  • test_nanny_death_timeout
  • test_num_fds

Comment thread distributed/http/worker/prometheus/core.py Outdated
PrometheusHandler._initialized = True
PrometheusHandler._collector = WorkerMetricCollector(self.server)
# Register collector
prometheus_client.REGISTRY.register(PrometheusHandler._collector)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple concurrent Worker instances in the same process (very common with async workers in all gen_cluster tests), this means PrometheusHandler._collector will only be set to the WorkerMetricCollector of the last worker to start. Is that okay? Does the prometheus_client.REGISTRY.register(PrometheusHandler._collector) mean that all the collectors will be registered somewhere anywhere?

I don't know what either _collector or prometheus_client.REGISTRY are for, I'm just seeing some global variables getting overwritten in a place where I know there might not just be multiple workers sequentially, but also multiple workers in parallel.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this had been implemented before was that we'd simply set up a PrometheusHandler for the first worker spinning up in the process, set _initialized to True and call it a day. This led to test_prometheus_collect_task_states because that PrometheusHandler did not belong to the newly spun-up worker when running multiple tests. This new pattern surely isn't perfect as we only ever have a single instance of a PrometheusHandler per process talking to the last worker we spun up. However, IMO this is better than what we had before and it would require some more serious thinking to understand the Prometheus client to handle parallel workers gracefully. I'm happy to file a follow-up issue for that.

Regarding the singleton implementation: I've stolen that from https://github.com/hendrikmakait/distributed/blob/13e315c11c7277ba0cadd9f5c2a16364cdeaf14b/distributed/http/scheduler/prometheus/core.py#L79

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty painful when we started testing this stuff. I'm OK with either solution and don't think we need anything more sophisticated for now.

As was already pointed out, this is merely an artifact for testing and no real world application would ever run multiple workers in the same process AND use prometheus. At the very least, this is something we then choose to not properly support

TLDR As long as the testing works out, I'm happy

port = s.http_server.port
response = await http_client.fetch(f"http://localhost:{port}/metrics")
assert response.code == 200
assert response.headers["Content-Type"] == "text/plain; version=0.0.4"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert the version? I'd think it's okay if that changes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly added this for coherence with other tests assuming someone had a reason to add this. I don't mind dropping this everywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped during refactoring.

family.name: family for family in text_string_to_metric_families(txt)
family.name: family
for family in text_string_to_metric_families(txt)
if family.name.startswith("dask_scheduler_")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?

I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less about worker metrics (though I am not 100% sure that's not happening), but more about additional Python-related metrics that we collect but don't want to test for in here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?

I'm open to it, but that's outside of the scope of this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?

They shouldn't, we have different collectors on different server endpoints. They also use worker/scheduler-specific methods to gather metrics to display.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async clients look like an interesting thing but the actual value they provide is a bit unclear to me.

I guess this can make the collection a bit smoother (currently it's a plain generator) such that it doesn't block the event loop during collection. Once we have more experience with this we can revisit this. Last time I was using prometheus I didn't encounter any loop instabilities

Comment on lines +124 to +126
# request data twice since there once was a case where metrics got registered multiple times resulting in
# prometheus_client errors
await fetch_metrics()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary? We're just making sure it doesn't error the second time?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behavior we currently have on main. I'm open to changing this but I also lack context as to why this has been necessary at some point in time.


http_client = AsyncHTTPClient()

async def fetch_metrics():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like fetch_metrics could be factored out to a utils_test function?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread distributed/http/worker/tests/test_worker_http.py Outdated
Comment thread distributed/http/worker/tests/test_worker_http.py Outdated
Comment thread distributed/http/worker/tests/test_worker_http.py Outdated
@fjetter

fjetter commented Aug 30, 2022

Copy link
Copy Markdown
Member

@gjoseph92 I leave merging to you

@hendrikmakait hendrikmakait force-pushed the improve-worker-metric-collector-testing branch from 801c4cd to 51f9781 Compare August 31, 2022 08:27
@fjetter

fjetter commented Aug 31, 2022

Copy link
Copy Markdown
Member

merging once CI is done

@hendrikmakait

Copy link
Copy Markdown
Member Author

CI flakes: #4955, #6967 (hasn't been merged into this branch IIRC)

@fjetter fjetter merged commit cefd9db into dask:main Aug 31, 2022
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve testing of WorkerMetricCollector

3 participants